Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment
نویسندگان
چکیده
This paper explores the task of building an accurate prepositional phrase attachment corpus for new genres while avoiding a large investment in terms of time and money by crowdsourcing judgments. We develop and present a system to extract prepositional phrases and their potential attachments from ungrammatical and informal sentences and pose the subsequent disambiguation tasks as multiple choice questions to workers from Amazon’s Mechanical Turk service. Our analysis shows that this two-step approach is capable of producing reliable annotations on informal and potentially noisy blog text, and this semi-automated strategy holds promise for similar annotation projects in new genres.
منابع مشابه
Towards Semi-Automated Annotation for Prepositional Phrase Attachment
This paper investigates whether high-quality annotations for tasks involving semantic disambiguation can be obtained without a major investment in time or expense. We examine the use of untrained human volunteers from Amazon’s Mechanical Turk in disambiguating prepositional phrase (PP) attachment over sentences drawn from the Wall Street Journal corpus. Our goal is to compare the performance of...
متن کاملCorpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary
This paper deals with two important ambiguities of natural language: prepositional phrase attachment and word sense ambiguity. We propose a new supervised learning method for PPattachment based on a semantically tagged corpus. Because any sufficiently big sense-tagged corpus does not exist, we also propose a new unsupervised context based word sense disambiguation algorithm which amends the tra...
متن کاملDisambiguation of English PP Attachment using Multilingual Aligned Data
Prepositional phrase attachment (PP attachment) is a major source of ambiguity in English. It poses a substantial challenge to Machine Translation (MT) between English and languages that are not characterized by PP attachment ambiguity. In this paper we present an unsupervised, bilingual, corpus-based approach to the resolution of English PP attachment ambiguity. As data we use aligned linguist...
متن کاملCreating an Appropriate Corpus for PP Attachment Training
This paper describes work in progress that is identifying shortcomings of existing Prepositional Phrase (PP) attachment algorithms and producing a new resource derived from the Penn TreeBank (PTB) corpus. The aim is to use this new resource (PTB Prime) to improve the accuracy of PP attachment algorithms and use this in an existing text processing system (LaSIE-II).
متن کاملStrategies Used in the Translation of Interlingual Subtitling
This study was an attempt to identify the interlingual strategies employed to translate English subtitles into Persian and to determine their frequency, as well. Contrary to many countries, subtitling is a new field in Iran. The study, a corpus-based, comparative, descriptive, non-judgmental analysis of an English-Persian parallel corpus, comprised English audio scripts of five movies of differ...
متن کامل